Install packages & load libraries necessary.
## Warning: package 'BiocGenerics' was built under R version 4.0.3
## Warning: package 'IRanges' was built under R version 4.0.3
## Warning: package 'S4Vectors' was built under R version 4.0.3
## Warning: package 'circlize' was built under R version 4.0.3
## Warning: package 'kableExtra' was built under R version 4.0.3
## Warning: package 'stringr' was built under R version 4.0.3
The Expression Dataset I chose was GSE97356: Gene expression associated with PTSD in World Trade Center responders: An RNA sequencing study.
A bit about the dataset
gse <- getGEO("GSE97356",GSEMatrix=FALSE)
Dataset details:
Dataset platform details:
Name of Platform: Illumina HiSeq 2000 (Homo sapiens)
Submission Data Date: Nov 02 2010
Last Date Data was Updated: Mar 27 2019
Organisms Included in Platform: Homo sapiens
Quantity of GEO datasets that use this Platform: 8531
Quantity of GEO samples that use this Platform: 140818
Clean the Data and Map to HUGO Symbols
if(!file.exists('GEOmetadb.sqlite')) getSQLiteFile()
con <- dbConnect(SQLite(),'GEOmetadb.sqlite')
sfiles <- getGEOSuppFiles('GSE97356')
fnames = rownames(sfiles)
reads = read.delim(fnames[1],header=TRUE,sep = ",")
#rename first column to genes, as it's more meaningful than X
colnames(reads)[1] <- c("Gene")
The dimensions of my dataset are: 25830, 325
This means there are 25830 rows and 325 columns (the rows indicate genes and the columns are samples, ignoring the first column which delimits that the rows are genes).
I have three kinds of samples in my 324 subjects, based on the study design. The study design says “324 human samples; 201 never, 82 current, 41 past with PTSD”. I want to get a list of which subjects have never/current/past PTSD diagnosis to compare to my reads dataframe.
for(i in 1:324) {
status <- gse@gsms[[i]]@header["characteristics_ch1"][[1]][1]
case <- gse@gsms[[i]]@header["title"]
if(i==1){
metainfo <- data.frame("status"= status, "case" = case)
}
else{
currentMetainfo <- data.frame("status"= status, "case" = case)
metainfo <- rbind(metainfo, currentMetainfo)
}
}
#for a sanity check ensure correct number of subjects with current/never/past PTSD diagnosis
occurences<-table(unlist(metainfo$status))
print(occurences)
##
## ptsd: Current ptsd: Never ptsd: Past
## 81 201 42
The metainformation found in the gse shows 81 subjects with current PTSD, 201 subjects who never had PTSD, and 42 who had past PTSD diagnosis. This doesn’t match the study design… but I looked at the paper and the published paper shows: “201 with never, 81 current, 42 past PTSD”, so we are actually correctly parsing the data, the study design is incorrectly labelled on GEO.
Filter weakly expressed features from my dataset
I’m going to filter lowly expressed genes using edgeR. Since I am keeping the genes that have at least 1 count per million within smallest group sample size, which is 42 (past PTSD), I’m looking for genes that have at least 1 count per million at least 42 times.
cpms = cpm(reads[,2:325])
#add in the genes associated with rownames
rownames(cpms) <- reads[,1]
qualityReads = rowSums(cpms >1) >= 42
filteredReads = reads[qualityReads,]
The filtered dimensions of the dataset now are: 14184, 325.
During filtering we removed 11646 genes.
Checking for duplicates after filtering
I’m going to see if the dataset has duplicates:
filteredReadsDf <- data.frame(table(filteredReads$Gene))
colnames(filteredReadsDf) <- c("Gene", "GeneOccurence")
duplicates <- filteredReadsDf %>% filter(GeneOccurence > 1)
It looks like I don’t have any duplicated gene names, the number of duplicated gene names is 0 .
Let’s take a look at how the gene names are stored:
head(filteredReadsDf$Gene, 5)
## [1] A1BG A1BG-AS1 A2M A2M-AS1 A2MP1
## 14184 Levels: A1BG A1BG-AS1 A2M A2M-AS1 A2MP1 A3GALT2 AAAS AACS ... ZZZ3
Looking at gene AADACL2-AS1 for example, I see it is AADACL2 Antisense RNA 1. These appear to be gene ids, but I want to convert to HUGO symbols. I’m going to see if I can find the corresponding ensembl ids.
Mapping to Ensembl Ids
grch37 <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org", path="/biomart/martservice") #from https://support.bioconductor.org/p/62064/
ensembl_grch37 = useDataset("hsapiens_gene_ensembl",mart=grch37)
idToEnsembl <- getBM(attributes = c("wikigene_name", "ensembl_gene_id"),
filters = c("wikigene_name"),
values = filteredReads$Gene,
mart = ensembl_grch37)
I can see that I have 0 gene ids that do not map to ensembl ids, so I’m not concerned. I’m now going to convert the ensembl ids to HUGO symbols.
Mapping to HUGO symbols
#first add the ensembl id labels to filteredReads
colnames(idToEnsembl) <- c("Gene", "EnsemblId")
filteredReads <- merge(idToEnsembl, filteredReads, by="Gene")
#now lets get the corresponding HUGO symbols
ensemblToHUGO <- getBM(attributes = c("ensembl_gene_id","hgnc_symbol"),
filters = c("ensembl_gene_id"),
values = filteredReads$EnsemblId,
mart = ensembl_grch37)
Now I have 413 ensembl ids that do not map to HUGO symbols. Lets see how many ensembl ids are duplicated.
Checking for duplicates in Ensembl Ids
filteredReadsDfEnsembl <- data.frame(table(filteredReads$EnsemblId))
colnames(filteredReadsDfEnsembl) <- c("EnsemblGene", "GeneOccurence")
duplicates <- filteredReadsDfEnsembl %>% filter(GeneOccurence > 1)
length(duplicates$EnsemblGene)
## [1] 549
I have 549 duplicated ensembl ids. I wonder how large the overlap between the gene names that are duplicated and those that do not have HUGO symbols is.
#first add the HUGO symbols to to filteredReads
colnames(ensemblToHUGO) <- c("EnsemblId", "HUGOSymbol")
filteredReads <- merge(ensemblToHUGO, filteredReads, by="EnsemblId")
emptyHUGO <- filteredReads %>% filter(HUGOSymbol == "")
duplicatedEnsembl <- data.frame(duplicates$EnsemblGene)
colnames(duplicatedEnsembl) <- c("EnsemblId")
overlapMissingDuplicated <- merge(emptyHUGO, duplicatedEnsembl, by="EnsemblId")
length(overlapMissingDuplicated$EnsemblId)
## [1] 117
It looks like there are 117 gene names that are duplicated ensembl ids and do not have HUGO symbols. The gene names that are missing HUGO symbols make up 3.0299283 % of the dataset. As some of these, 21.3114754 % exactly, have no corresponding HUGO symbol, but all have unique wikigene names, I’m not sure whether these genes are being mapped correctly and don’t feel comfortable removing them.
Additionally, according to lecture 4 slides 26 “if we base our analysis on ensembl gene ids then they are unique elements” as a comment not to necessarily filter out duplicates, and as I have removed lowly expressed genes I feel this is enough at this stage. Instead I’m going to try and preserve all these genes with duplicated names and just create versions of them so theyre unique. This is because the end goal of this assignment is to produce a dataframe with 324 numeric columns where each row has unique HUGO symbols as rownames. Lets see how many HUGO symbols are duplicated:
filteredReadsDfHUGO <- data.frame(table(filteredReads$HUGOSymbol))
colnames(filteredReadsDfHUGO) <- c("HUGOGene", "GeneOccurence")
HUGOduplicates <- filteredReadsDfHUGO %>% filter(GeneOccurence > 1)
length(HUGOduplicates$HUGOGene)
## [1] 1061
There are 512 more duplicated HUGO symbols than ensembl ids. This is a lot.
duplicatedHUGO <- data.frame(HUGOduplicates$HUGOGene)
colnames(duplicatedHUGO) <- c("HUGOSymbol")
duplicatedHUGOFull <- merge(duplicatedHUGO, filteredReads, by="HUGOSymbol")
duplicatedHUGOEnsembl <- merge(duplicatedHUGOFull, duplicatedEnsembl)
length(unique(duplicatedHUGOEnsembl$EnsemblId))
## [1] 549
This shows at least that all the duplicated ensembl ids also have duplicated HUGO symbols, so I’m really contending with the 512 duplicated HUGO symbol genes. I’m going to make the namings of these symbols unique so I can keep them all as discussed above. First I’m going to remove the ensembl ids that do not map to HUGO symbols, of which 117 are duplicated.
filteredReadsFinal <- filteredReads %>% filter(HUGOSymbol != "")
I lost 490 genes that didn’t have HUGO symbols. Now I’m going to make the namings of the HUGO symbols unique.
filteredReadsFinal <- filteredReadsFinal[, c(2,4:ncol(filteredReadsFinal))]
filteredReadsFinal$HUGOSymbol <- make.names(filteredReadsFinal$HUGOSymbol, unique=TRUE)
filteredReadsFinal <- tibble::column_to_rownames(filteredReadsFinal, var="HUGOSymbol")
Apply Normalization
Visualize the data
countDensity <- apply(log2(cpm(filteredReadsFinal)), 2, density)
xlim <- 0; ylim <- 0
for (i in 1:length(countDensity)) {
xlim <- range(c(xlim, countDensity[[i]]$x));
ylim <- range(c(ylim, countDensity[[i]]$y))
}
cols <- rainbow(length(countDensity))
ltys <- rep(1, length(countDensity))
p1 <- plot(countDensity[[1]], xlim=xlim, ylim=ylim, type="n",
ylab="Smoothing density of log2-CPM", main="", cex.lab = 0.85)
for (i in 1:length(countDensity)) lines(countDensity[[i]], col=cols[i], lty=ltys[i])
legend("topright", colnames(dataToPlot),
col=cols, lty=ltys, cex=0.75,
border ="blue", text.col = "green4",
merge = TRUE, bg = "gray90")
We can see some data is not following the same overall pattern as the other samples. Let’s normalize.
Normalizing
Let’s see if we can understand what happened visually.
The boxplot is difficult to interpret, but you do see an adjustment towards the median especially between cases 181 to 223. The count density graph does also show a change in the dataset distribution after normalization. The patients are more tightly clustered, for one.
MDS Plot
plotMDS(d, labels=(metainfo$title), col = c("green","blue", "red")[factor(metainfo$status)])
Here the colour-coding is defined as follows: Blue are patients that have never had PTSD, red are patients that have been diagnosed in the past with PTSD, and green are patients that currently have PTSD.
Since MDS plot represents the distances between samples, we would hope to see past,never and current patients cluster together, but we don’t quite see this.
We want to to visualize our normalized data using a heatmap to get an idea of gene expression patterns.
I will take a closer look at the clustering of patients in my dataset.
Once again, it is clear to see that the patients are not clustering by the metrics evaluated in the study, i.e. there are not three separate clusters of samples corresponding to the PTSD presence/absence/past diagnosis categories. The majority of the patients are clustering around 0 in the leading logFC dim1 and dim2.
Another way to visualize the clustering is to colour by patient, but seeing as there aren’t two samples from one patient under differing conditions (i.e. before PTSD diagnosis and after, the following plot is not truly helpful in determining clustering patterns in the patients, definiteluy not as useful as if there were test and control conditions for each subject in the dataset.)
Limma Analysis
I’ll begin by creating a linear model using Limma (Ritchie et al., 2005), starting with a design matrix.
| (Intercept) | metainfo\(status Never </th> <th style="text-align:right;"> metainfo\)status Past | |
|---|---|---|
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 1 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 1 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 1 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
The next step is to create a data matrix and fit the data to the above model. This will be done using empirical Bayes to compute differential expression (note trend=TRU because this si RNAseq data). From this a table will be generated which contains p values and adjusted p values for gene expressions calculated.
| x | logFC | AveExpr | t | P.Value | adj.P.Val | B | |
|---|---|---|---|---|---|---|---|
| 11814 | SETD7 | -0.7443755 | 6.340630 | -3.745703 | 0.0002130 | 0.9900089 | -3.794917 |
| 2117 | CD44 | -7.8225238 | 122.273084 | -3.203193 | 0.0014948 | 0.9900089 | -4.021657 |
| 5030 | GPN3 | 0.1983226 | 1.170303 | 3.196877 | 0.0015269 | 0.9900089 | -4.024114 |
| 15347 | ZNF439 | 0.2208321 | 1.076629 | 3.188443 | 0.0015707 | 0.9900089 | -4.027389 |
| 11677 | SDHC | -0.3881468 | 5.171646 | -3.172328 | 0.0016577 | 0.9900089 | -4.033625 |
| 9003 | NT5C | 0.1702865 | 1.112761 | 3.132173 | 0.0018942 | 0.9900089 | -4.049041 |
| 2198 | CDK12 | -1.1712040 | 16.059256 | -3.115183 | 0.0020034 | 0.9900089 | -4.055511 |
| 8900 | NPIPA2 | -0.1202229 | 0.566841 | -3.050312 | 0.0024756 | 0.9900089 | -4.079927 |
| 8903 | NPIPA3.1 | -0.1202229 | 0.566841 | -3.050312 | 0.0024756 | 0.9900089 | -4.079927 |
| 8905 | NPIPA5 | -0.1202229 | 0.566841 | -3.050312 | 0.0024756 | 0.9900089 | -4.079927 |
## [1] 495
## [1] 0
Multiple Hypothesis Testing*
As there are multiple tests being done to check the significance of so many genes, we have to correct for multiple tests. This is due to the fact that the more tests are performed, the greater the likelihood of a false positive occurring by chance.
Lecture slides suggested that to minimize the false discovery rate the linear model should be updated to account for both the patients’ gene expression as well as the number of patients (note the previous model just focused on the patients’ gene expression).This could help us to control for patient variability.
However, I received the message : Error in .ebayes(fit = fit, proportion = proportion, stdev.coef.lim = stdev.coef.lim, : No residual degrees of freedom in linear model fits when I tried to fit the model below using empirical Bayes as I don’t have replicated samples (Bioconductor Support, 2016). Controlling for patient variability through multiple testing does not apply to this dataset. Therefore the previous model using the Benjamni - Hochberg correction method is sufficient.
modelDesignPat <- model.matrix(~ metainfo$title + metainfo$status)
modelDesignPat[1:10,1:5]
## (Intercept) metainfo$titleCase_10 metainfo$titleCase_100
## 1 1 0 0
## 2 1 0 0
## 3 1 0 0
## 4 1 0 0
## 5 1 0 0
## 6 1 0 0
## 7 1 0 0
## 8 1 0 0
## 9 1 0 0
## 10 1 1 0
## metainfo$titleCase_101 metainfo$titleCase_102
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## 7 0 0
## 8 0 0
## 9 0 0
## 10 0 0
EdgeR Analysis
I looked at how the “Gene expression associated with PTSD in World Trade Center responders: An RNA sequencing study” paper performed differential expression analysis to get a better idea of where to go from here. The differential expression analysis they did was done using “DESeq252 software based on negative binomial generalized linear models, adjusting for age, race and the five cell type proportions (CD8T, CD4T, natural killer, Bcell, monocytes) in discovery and replication cohorts, respectively”.
This gave me reason to believe that I should not be using the Limma package, as it works best with data that has an underlying linear distribution.I decided to try EdgeR (Robinson et al., 2010), a package for data that works best when the data follows a negative binomial distribution.
A negative binomial distribution has a specific shape when plotted, so I can easily verify that my data is not following a linear distribution by seeing if it adheres to the shape of the curve of a negative binomial distribution.
|
|
|
|
Using this new underlying distribution assumption, I want to see if I have any significantly differentially expressed genes now:
## [1] 2226
## [1] 32
Excitingly, I have 2226 genes pass the threshhold p value and 32 pass FDR (Benjamini Hochberg) correction.
Now that I have some differently expressed genes, let’s view these bad boys in a volcano plot!
Volcano Plot
Let’s take a look at upregulated genes.
There are not a lot of upregulated genes (clearly… there’s only two red dots), so let’s see what the volcano plot looks like with downregulated genes.
Seems to be a bit more interesting.
Top Hits
Now I want to make another heatmap and see if there is any clustering that is happening by PTSD: past, never, current condition. I will use ComplexHeatmap (Gu, 2016).
324 samples make the column names difficult to read for the heatmap, but they are ordered by “current”, “never” and “past” PTSD diagnosis as seen by Case_10, Case_14, Case_16, Case_19, Case_32. There doesn’t seem to be extremely obviously clustering by diagnosis, but in the bottom right corner of the heatmap there is a section that’s more blue and preceding it is a section that is more pale red, it’s also more pale red at the beginning.
Using limma 495 genes pass the threshhold p value and 0 genes pass correction. No values are therefore significant after correction.
Using EdgeR 2226 genes pass the threshhold p value and 32 genes pass FDR (BH) correction.
I chose the threshhold p value of 0.05 as it is accepted across scientific literature as significant. There is little benefit to being more stringent especially since I am doing multiple-hypothesis testing anyways.
See Multiple Hypothesis Testing. I used the Benjamini-Hochberg (BH) method, as it is less strict than Bonferroni but still accepted as robust. There were 32 genes that passed the FDR (BH) correction.
See Volcano Plot.
See Top Hits.
Let’s start by getting the upregulated and downregulated genes.
Getting Genes of Interest
I’ll be using EdgeR as that is the method that permitted me to find differentially expressed genes.
## [1] 71
## [1] 2155
There are 71 upregulated genes, and 2155 downregulated genes, which is pretty spicy. I want to get the EnsemblIDs to put them into the g:profiler webpage to find matches so the next step is to get the EnsemblIDs and save them into a text file.
Annotation Data
The annotation data sets I selected on g:profiler (Raudvere et al., 2019) GO biological process, Reactome, and WikiPathways, as they were the datasets we had used in a previous assignment and I felt comfortable interpreting them. The version I used was e102_eg49_p15_7a9b4d6.
Genesets Returned
I used the same significance thresholds as when I performed gene expression analysis (0.05) with multiple-hypothesis testing correction, i.e. with the FDR-corrected p values. This was done using the Benjamini-Hochberg method, and again, the threshhold for p values was 0.05, as this is the standard.
Up Regulation Results
GO biological pathways: T= 30, Q= 36 T∩Q= 4
Reactome: T= 191, Q= 26 T∩Q= 7
Wiki Pathways: T= 0, Q= 0 T∩Q=0
Down Regulation Results
GO biological pathways: T= 4027, Q= 1682 T∩Q= 582
Reactome: T= 234, Q= 1058 T∩Q= 63
Wiki Pathways: T= 162, Q= 800 T∩Q= 52
All Gene Results
GO biological pathways: T= 4027, Q= 1718 T∩Q= 591
Reactome: T= 234, Q= 1084 T∩Q= 63
Wiki Pathways: T= 162, Q= 826 T∩Q= 52
Screenshots of gProfiler results
Upregulated genes
(No Wiki Pathways Results)
The genesets returned top terms related to hydrogen peroxide catabolic process and interferon signaling, generally related to immune system and metabolic processes (Gongora et al., 1999). This suggests the upregulated genes are involved in antiviral defense/immunity.
Downregulated genes
The genesets returned top terms related to protein modification enzymes, chromatin modification enzymes and the EGF/EGFR signal pathway which is involved in growth, differentiation, migration, adhesion and cell survival (WikiPathways, 2020). This indicates the downregulated genes are involved in cell growth and differentiation.
All Genes
The top term genesets for all genes were dominated by the downregulated gene results, again related to cell differentiation/survival.
I chose EdgeR as it had previously returned differentiated genes in the dataset, and the undelying assumptions of the data distributions matched this dataset best.
See Annotation Data.
See Genesets Returned.
See Genesets Returned. The results for all genes together were overall dominated by the the down-regulated genes, putting the genes together only returned 9 more genes in the intersection of T and Q. The terms returned were related to the same overarching themes as for the downregulated genes. However, there’s a clear distinction in term results between up-regulated and down-regulated genes (immune response/antiviral defense and cell differentiation/growth respectively).
See Genesets Returned.
The paper discussed they had found enrichment in “glucocorticoid receptor signaling and immunity-related pathways” but mentioned that these pathways were not robust to FDR correction. The paper also mentioned that gene FKBP5, with EnsembleID ENSG00000096060 was upregulated in patients with PTSD, which is found in the downregulated genes result of my analysis (downregulated in non-PTSD subjects as compared to those with PTSD). The glucocorticoid receptor pathway is a component of the immune response and we can see these immune response pathways come up in my analysis as well (upregulated gene pathways, which would be downregulated in PTSD patients).
There is a lot of evidence supporting immune dysregulation in PTSD patients, which supports the results found in my upregulated gene pathway analysis. For example, the paper “Inflammatory markers in post-traumatic stress disorder: a systematic review, meta-analysis, and meta-regression” showed increased inflammation in subjects with PTSD compared to healthy controls (Passos et al., 2015). Another poignant example of this finding was the paper “Posttraumatic stress disorder and physical illness: results from clinical and epidemiologic studies” which compiled evidence to show that patients with PTSD may be at risk for autoimmune diseases (Boscarino, 2004).
Bioconductor Support. (2016). ERROR: No residual degrees of freedom in linear model fits. Retrieved March 15, 2021, from https://support.bioconductor.org/p/59168/
Boscarino JA (2004). Posttraumatic stress disorder and physical illness: results from clinical and epidemiologic studies. Ann N Y Acad Sci.1032:141-53. doi: 10.1196/annals.1314.011. PMID: 15677401.
Gongora C, Mechti N. L’interféron: un mécanisme complexe de signalisation [Interferon signaling pathways] (1999). Bull Cancer. 86(11):911-9. French. PMID: 10586107.
Gu, Z. (2016) Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics.
Kuan, P., Waszczuk, M. A., Kotov, R., Clouston, S., Yang, X., Singh, P. K., . . . Luft, B. J. (2017). Gene expression associated with PTSD in World Trade Center responders: An RNA sequencing study. Translational Psychiatry, 7(12). doi:10.1038/s41398-017-0050-1
Passos IC, Vasconcelos-Moreno MP, Costa LG, Kunz M, Brietzke E, Quevedo J, Salum G, Magalhães PV, Kapczinski F, Kauer-Sant’Anna M (2015). Inflammatory markers in post-traumatic stress disorder: a systematic review, meta-analysis, and meta-regression. Lancet Psychiatry, 2(11):1002-12. doi: 10.1016/S2
Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W., and Smyth, G.K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43(7), e47.
Robinson MD, McCarthy DJ and Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139-140
Uku Raudvere, Liis Kolberg, Ivan Kuzmin, Tambet Arak, Priit Adler, Hedi Peterson, Jaak Vilo: g:Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update) Nucleic Acids Research 2019; doi:10.1093/nar/gkz369 [PDF].
Wiki Pathways. (2020, June 30). EGF/EGFR Signaling Pathway (Homo sapiens). Retrieved from https://www.wikipathways.org/index.php/Pathway:WP437